acc: Always destroy deployed bundles on test exit in cloud runs#5585
Open
chrisst wants to merge 2 commits into
Open
acc: Always destroy deployed bundles on test exit in cloud runs#5585chrisst wants to merge 2 commits into
chrisst wants to merge 2 commits into
Conversation
When acceptance tests run against real cloud workspaces (CLOUD_ENV set), a test that fails, times out, or exits mid-script never reaches its own 'bundle destroy' step: scripts run under 'bash -e' and the merged script.cleanup parts are skipped on failure. The deployed resources (SQL warehouses, pipelines, jobs, ...) then leak in the shared test workspaces. Leaked started warehouses recently exhausted a GCP quota and took CI down for two days; we observed 100+ leaked warehouses and dozens of leaked test pipelines in a single workspace. This adds a harness-level safety net: on cloud runs, runTest registers a t.Cleanup (before starting the script, so it also covers timeouts and mid-test failures) that scans the test's temp dir for bundle state directories (<bundle_root>/.databricks/bundle/<target>) and runs '$CLI bundle destroy --auto-approve --target <target>' in each bundle root, reusing the exact environment the script ran with. The mechanism is deliberately best effort and invisible to test output: - It is a no-op for local testserver runs (gated on CLOUD_ENV). - It runs after output comparison and logs only via t.Logf, so cleanup output is never compared against expected out files. - Double-destroy is harmless: 'bundle destroy' on an already-destroyed bundle exits 0 with 'No active deployment found to destroy!'. In the common success path the shared script.cleanup already removed .databricks, so nothing is even attempted. - Destroy failures are logged but never fail the test. Co-authored-by: Isaac
Use context.WithoutCancel(t.Context()) instead of context.Background() (t.Context() is already canceled when cleanups run), make the best-effort nilerr skip explicit, and trim narration comments. Co-authored-by: Isaac
Contributor
Waiting for approvalBased on git history, these people are best suited to review:
Eligible reviewers: Suggestions based on git history. See OWNERS for ownership rules. |
Collaborator
Integration test reportCommit: 2aadb70
60 interesting tests: 38 FAIL, 15 SKIP, 7 KNOWN
Top 25 slowest tests (at least 2 minutes):
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a harness-level guarantee that bundles deployed by acceptance tests are destroyed when the test exits, for cloud runs (
CLOUD_ENVset):runTestregisters at.Cleanupbefore the script starts (covering failures,requireaborts, and script timeouts), capturing a clone of the exact env the script ran with;.databricks/bundle/<target>state dirs and runs$CLI bundle destroy --auto-approve --target <target>per bundle root (10-minute cap per destroy);t.Logfonly — never fail the test; double-destroy is harmless ("No active deployment found to destroy!" exits 0);Why
Acceptance scripts run under
bash -e, andscript.cleanupfragments are appended after the main body — they never execute when a script fails or times out betweenbundle deployandbundle destroy. Against shared cloud test workspaces this leaks real resources: during the 2026-06-11/12 incident one shared GCP workspace had accumulated 100+ leaked test warehouses and dozens of leakedtest-bundle-pipeline-*pipelines, exhausting the project's local-SSD quota and blocking terraform-provider CI for ~2 days (ref ES-1974228).Cleanup output cannot pollute golden files: it runs after output comparison and goes only to the test log.
Known limitation: destroy is best-effort — a bundle deployed with required
--varflags or a config corrupted mid-test may still fail to destroy; this is logged as a leak warning rather than failing the run.Tests
go build ./...,go vet ./acceptancepass; local deploy+destroy acceptance tests (bundle/resources/sql_warehouses,bundle/resources/pipelines/recreate-keysacross all engine variants) pass with no output regressions.This pull request and its description were written by Isaac.